Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays

Similar documents
1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth

Lecture 6 Sequences II. Parallel and Sequential Data Structures and Algorithms, (Fall 2013) Lectured by Danny Sleator 12 September 2013

Scan and its Uses. 1 Scan. 1.1 Contraction CSE341T/CSE549T 09/17/2014. Lecture 8

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs

Scan and Quicksort. 1 Scan. 1.1 Contraction CSE341T 09/20/2017. Lecture 7

Divide and Conquer Algorithms

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

1 Ordered Sets and Tables

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

1 Dynamic Programming

1 Dynamic Programming

Applications of Suffix Tree

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Algorithms and Data Structures Lesson 3

Solving Linear Recurrence Relations (8.2)

Data Structure and Algorithm Midterm Reference Solution TA

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

Lecture 6: Sequential Sorting

Lecture 7 February 26, 2010

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

ECE 122. Engineering Problem Solving Using Java

Suffix-based text indices, construction algorithms, and applications.

Dynamic Programming Algorithms

Lecture 18 April 12, 2005

Given a text file, or several text files, how do we search for a query string?

Lecture 21: Red-Black Trees

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

We will give examples for each of the following commonly used algorithm design techniques:

1 Introduction. 2 InsertionSort. 2.1 Correctness of InsertionSort

Binary Search to find item in sorted array

Lecture 1: Overview

4. Suffix Trees and Arrays

Exact String Matching. The Knuth-Morris-Pratt Algorithm

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016

TU/e Algorithms (2IL15) Lecture 2. Algorithms (2IL15) Lecture 2 THE GREEDY METHOD

CS 506, Sect 002 Homework 5 Dr. David Nassimi Foundations of CS Due: Week 11, Mon. Apr. 7 Spring 2014

Indexing and Searching

Spring 2018 Lecture 8. Sorting Integer Trees

CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019

(Refer Slide Time 3:31)

Recitation 1. Scan. 1.1 Announcements. SkylineLab has been released, and is due Friday afternoon. It s worth 125 points.

Problem Set 9 Solutions

Sankalchand Patel College of Engineering - Visnagar Department of Computer Engineering and Information Technology. Assignment

Sorting: Given a list A with n elements possessing a total order, return a list with the same elements in non-decreasing order.

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

(Refer Slide Time: 01.26)

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

15 210: Parallel and Sequential Data Structures and Algorithms

DIVIDE AND CONQUER ALGORITHMS ANALYSIS WITH RECURRENCE EQUATIONS

CSE 373 Spring 2010: Midterm #1 (closed book, closed notes, NO calculators allowed)

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Principles of Algorithm Design

Lecture L16 April 19, 2012

The Limits of Sorting Divide-and-Conquer Comparison Sorts II

1 (15 points) LexicoSort

Lecture 9 March 4, 2010

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

SORTING AND SELECTION

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

4. Suffix Trees and Arrays

Verification and Parallelism in Intro CS. Dan Licata Wesleyan University

Jana Kosecka. Linear Time Sorting, Median, Order Statistics. Many slides here are based on E. Demaine, D. Luebke slides

Greedy Algorithms CHAPTER 16

SORTING, SETS, AND SELECTION

Sorting: Overview/Questions

CSE 373 MAY 24 TH ANALYSIS AND NON- COMPARISON SORTING

String Algorithms. CITS3001 Algorithms, Agents and Artificial Intelligence. 2017, Semester 2. CLRS Chapter 32

QB LECTURE #1: Algorithms and Dynamic Programming

CSCI 136 Data Structures & Advanced Programming. Lecture 9 Fall 2018 Instructors: Bills

Algorithms. Lecture Notes 5

15 150: Principles of Functional Programming Sorting Integer Lists

Lecture 15 Notes Binary Search Trees

6.338 Final Paper: Parallel Huffman Encoding and Move to Front Encoding in Julia

Assignment 2 (programming): Problem Description

COMP4128 Programming Challenges

COMP4128 Programming Challenges

CS 137 Part 8. Merge Sort, Quick Sort, Binary Search. November 20th, 2017

Lecture 13 Hash Dictionaries

Goal of the course: The goal is to learn to design and analyze an algorithm. More specifically, you will learn:

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms

Divide-and-Conquer Algorithms

CS166 Handout 11 Spring 2018 May 8, 2018 Problem Set 4: Amortized Efficiency

Recursive Algorithms II

Dynamic Programming. Lecture Overview Introduction

We have used both of the last two claims in previous algorithms and therefore their proof is omitted.

Recitation 1. Scan. 1.1 Announcements. SkylineLab has been released, and is due Friday afternoon. It s worth 125 points.

Spring 2012 Homework 10

CS302 Topic: Algorithm Analysis #2. Thursday, Sept. 21, 2006

University of Waterloo CS240 Spring 2018 Help Session Problems

5.1 The String reconstruction problem

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Suffix Array Construction

Strings. Zachary Friggstad. Programming Club Meeting

We can use a max-heap to sort data.

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

Intro to Algorithms. Professor Kevin Gold

Interlude: Problem Solving

Transcription:

Lecture 25 Suffix Arrays Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Kanat Tangwongsan April 17, 2012 Material in this lecture: The main theme of this lecture is suffix arrays. - Definition: Suffix arrays - Sorting all the suffixes faster than O(n 2 ). Here is a standard string-processing problem: find a pattern P in a big corpus of text S. For context, consider the following: Suppose we re given a DNA sequence S (e.g., S = AGTCG...), and we re interested preprocessing the string S so that after that, we can locate all occurrences of a pattern P ( P S ) in S fast (say O( P log S ) or O( P + log S )) 1 Last week, we discussed dynamic programming, where the main idea was to identify redundant computation and reuse the results whenever possible. Today, we are going to see another application of this idea, disguised in the context of string algorithms. 1 Suffix Arrays From searching for a string pattern to finding the longest repeated substrings to computing the maximal palindromes, suffix arrays and its close cousin got it covered. Unsurprisingly, they are heavily used in bioinformatics applications (finding patterns in protein/dna) and in data compression (Burrows-Wheeler transform; see bzip2) just to give a few examples. A suffix array is a data structure for efficiently answering queries on large strings. The suffix array of a string S is an array giving pointers to the suffixes of S sorted in lexicographical order of the suffixes. For concreteness, let suff i (S) denote the suffix of S starting at position i. As an example, consider the string S = parallel. The suffixes of S are: i suff i (S) 0 parallel 1 arallel 2 rallel 3 allel 4 llel 5 lel 6 el 7 l Lecture notes by Guy E Blelloch, Margaret Reid-Miller, and Kanat Tangwongsan. 1 This is actually better per-pattern than KMP or similar string search algorithms, which require about O( P + S ). 1 Version 0.5

Sorting these suffixes, we get allel arallel el l lel llel parallel rallel which results in the following suffix array: i = 0 1 2 3 4 5 6 7 SA[i] 3 1 6 7 5 4 0 2 Notice that in this example, the suffix array indicates that suff 3 < suff 1 < suff 6 < suff 7 < suff 5 <..., where < denotes the lexicographical ordering. 1.1 String Matching Using Suffix Array Let s go back to the problem posed at the beginning of the lecture. We ll pretend for a second that we know how to construct suffix arrays. What can we do now? Remember that we wanted to find occurrences of a pattern P in S. It is not hard to see that the pattern P occurs at position i of S if and only if P is a prefix of the suffix suff i (S). This means the occurrences of P in S = all suffixes of S having P as a prefix. This observation allows us to transform the string matching problem into a prefix matching problem on all the suffixes. But we haven t actually made any progress. To proceed, we will make use of the sortedness property of suffix arrays. Since suffix arrays are sorted, we could perform a binary search to locate an occurrence of the pattern. The number of comparisons required by a binary search is logarithmic in the size of the domain (i.e., the length of the suffix array, which has S entries). Each comparison is made between P and a suffix of S, which we know can be done in O( P ) work. Therefore, for a total of O( P log S ) work, we can locate an occurrence of P, or report that none exist. Finding Them All. If our goal is to find all occurrences of P, we ll need an additional an observation. The following observation follows from the sortedness of suffix arrays: Observation: For any prefix A, all suffixes of S that have prefix A are contiguous entries in the suffix array. Therefore, if we re looking for the pattern P = l in the string S = parallel, according to these observations, we should be looking for suffixes that begin with l and we know that all the 2 Version 0.5

occurrences are next to each other in the suffix array of S. Indeed, for the string S = parallel, the suffixes l, lel, and llel all the suffixes that begin with P = l are next to each other. Given this setup, our task now boils down to locating the starting position. This can be easily done using a binary search, which requires O(log S ) string comparisons since the suffix array of S contains S entries. As each comparsion can be performed in O( P ) work, the total work is O( P log S ). Remarks: With a bit more work, this can be improved to O( P + log S ). For comparsion, most efficient string searching algorithms (e.g., Knuth-Morris-Pratt and Rabin-Karp) allow us to find occurrences of a pattern P in a string S in time O( S + P ), so preprocessing the string S can be a big win if we re performing multiple searches. 1.2 How to Build A Suffix Array? We have just seen an application of suffix arrays. How can we build one given an input string? As the definition suggests, the most straightforward way to build a suffix array is to enumerate all the suffixes of a given string and sort them. On an input string of length n, the algorithm will perform O(n log n) comparisons (if an efficient comparision-based sorting algorithm is used), but unfortunately, each comparison is rather costly: comparing two strings of length n costs O(n) work. Therefore, the overall work of this simple approach is O(n 2 log n). This is slow for large input. Exercise 1. Given two strings S and T of length n, what s the work and span of comparing them? What does this say about the span of the naïve algorithm for suffix arrays? 2 Building A Suffix Array in Less Than O(n 2 ) Work One thing is clear about the above algorithm: we re doing a lot of redundant work comparing strings. While sorting n arbitrary length-n strings in a comparison model will need Ω(n 2 log n) work, the strings we re sorting aren t arbitrarily given. After all, we re working with the suffixes of a string. In this part of the lecture, we ll focus on improving the work done in comparing strings. The goal is to be able to build a suffix array in O(n log 2 n) work. The current state-of-the-art is O(n) (which is also parallel), but this is beyond the scope of the class. Say we want to compare two strings X = x 0 x 1... x n 1 and Y = y 0 y 1... y m 1, which have lengths n and m. The simplest implementation costs O(min(n, m)) work as it would compare x 0 against y 0 and returns an answer if x 0 differs from y 0, but otherwise, it proceeds to compare x 1 against y 1 so on so forth until it finds a location where the two strings differ. In code, we have: fun strcmp (X, Y) = case (X, Y) of (nil, nil) => EQUAL (nil, _) => LESS (_, nil) => GREATER (x::x,y::y ) => (case compare(x, y) of EQUAL => strcmp(x, Y ) 3 Version 0.5

r => r) For a moment, we ll assume that X and Y have the same length (i.e., n = m). We ll also pretend that we have done the comparison suff 1 (X ) and suff 1 (Y ) before. With these assumptions, if we have access to the outcome of r = strcmp(suff 1 (X ),suff 1 (Y )), then we can compare X with Y in O(1): if x 0 = y 0, then return r else return compare(x 0,y 0 ). But to take advantage of this idea, we ll probably need to compute a lot of states, most likely O(n 2 ) of them. We haven t made any saving. The idea is to generalize this approach further. We ll experiment with divide and conquer. For simplicity, we ll still assume that n = m. In addition, we ll assume that n = m = 2 k is a power of two. Then, we can write X and Y as X = X 0 X 1 Y = Y 0 Y 1, where X 0 = x 0 x 1... x n/2, X 1 = x n/2+1... x n 1, and Y 0 and Y 1 are defined similarly for Y. We have: Observation: strcmp(x,y) = case strcmp(x 0,Y 0 ) of EQUAL=>strcmp(X 1,Y 1 ) r=>r This leads to the following pseudocode implementation, which seems a bit convoluted and doesn t yet help us in any way: fun strcmp (X as X_0X_1, Y as Y_0Y_1) = case (X, Y) of... (* handle base cases *) _ => (case strcmp(x_0,y_0) of EQUAL => strcmp(x_1, Y_1) r => r) How might we reuse some of the results to speed up the construction of suffix arrays? Let s try the following simple idea: we ll split every suffix in half and hope that these substrings have been compared before. On our running example of S = parallel, we would have the following (we use the special symbol $, which is lexicographically smaller than any real symbol, to preserve the length): para llel aral lel$ rall el$$ alle l$$$ llel $$$$ lel$ $$$$ el$$ $$$$ l$$$ $$$$ 4 Version 0.5

We quickly realize that if we are to do less work than O(n 2 ), we couldn t have compared all combinations of these length-(n/2) pairs before. In fact, even if we could, we wouldn t have the space or time to store the information. There are too many pairs after all. Here is where we need a new idea: instead of storing the outcome of a comparison, we ll rank these substrings in a way their ranks can be compared in place of comparing the actual strings. Let rank i (S, l) denote the rank of the length-l substring of S starting at i (i.e., the substring S[i..i + l 1]). The guarantee we make is the following: for any length l > 0, Int.compare(rank i (S, l), rank j (S, l)) = String.compare(S[i..i + l 1], S[ j..j + l 1]) (1) One possibility is to rank the smallest string 0, the second smallest string 1, etc. Using this idea, we would assign the ranking: para: 6 llel: 5 aral: 1 lel$: 4 rall: 7 el$$: 2 alle: 0 l$$$: 3 llel: 5 $$$$: -1 lel$: 4 $$$$: -1 el$$: 2 $$$$: -1 l$$$: 3 $$$$: -1 Sorting The Suffixes of S Using Rank. Before we talk about computing the rank function, let s see how we can use the rank information to sort all the suffixes of a string S. Say we have the rank information for all length S /2 substrings of S, like in the example above. We can easily sort the suffixes of S using the following 2-step process: 1. Make a sequence (i, (rank i (S, S /2),rank i+n/2 (S, S /2))). 2. Sort this sequence by the rank pairs. 2.1 How to Compute the Rank Function? The idea is to construct the rank function recursively. To pursue this idea, there are two questions that need to be answered: 1. Base Case. How can we efficiently compute the rank function for l = 1? 2. Inductive Case. How can we quickly derive the rank function for l > 1 given the rank function for l < l? Presumably, we want an even split (e.g., l = l/2) to avoid O(n)-depth recursion. 5 Version 0.5

We ll start with the base case. More concretely, the task here is: given a length-n sequence of characters (i.e., the input has type char seq), construct a sequence R of length n such that R[i] is an integer denoting the rank of the character at position i of the input. The simplest approach (as was suggested by the class) is to use the ASCII value of these characters. It is easy to check that this satisfies the rank requirement, see Equation (1). In code, we can write the following: fun rankbase (S : char seq) : int seq = tabulate (fn i => Char.ord (nth S i)) (length S) We ll consider a slightly more complicated approach, which hints at what we ll need to do in the recursive case. This alternative also has the advantage that the rank numbers are between 0 and S 1 and that it works with black-box characters (you can only compare elements but you don t have any side information about their ordering). The following code captures the essence of the rank function in the base case: 1 % compute the rank where the i-th element is the rank of S[i..i + l 1] 2 fun rankbase(s) : int seq = 3 let 4 val pairs = (i, S[i]) : i = 0,..., S 1 5 val sortedpairs = sort (Char.compare o second) pairs 6 val newrank = (#1 sortpairs[i], i) : i = 0,..., S 1 7 in inject newrank 1 : i = 0,..., S 1 end It first sorts all the characters and use the index in the sorted sequence as the rank. The sequence newrank stores ordered pairs (position, rank), so we can use inject to derive the desired sequence. But as the following example shows, the algorithm doesn t correctly handle repeated characters. p a r a l l e l => (0,p), (1,a), (2,r), (3,a), (4,l), (5,l), (6,e), (7,l) sort => (1,a), (3,a), (6,e), (4,l), (5,l), (7,l), (0,p), (2,r) assign ranking: (index, rank) =? (1,0), (3,1), (6,2), (4,3), (5,4), (7,5), (0,6), (2,7) actually, we want: == (1,0), (3,0), (6,2), (4,3), (5,3), (7,3), (0,6), (2,7) What we need to do is to make sure that if s and t are the same character, they get the same rank number. Since the sequence is already sorted, this is easily done with a scan (see the lectures on scan). 6 Version 0.5

Inductive Case. Building on the ideas we laid out so far, we could represent each length l string as an ordered pair of rank numbers: that is, we will represent the substring S[i..i + l 1] as the ordered pair (rank i (S, l/2),rank i+l/2 (S, l/2)) (we re still assuming l is a power of two). This conversion allows us to sort these length-l substrings in O(n log n) work because each comparison is made between a pair of int pairs, which can be done in constant work, and we only need O(n log n) comparisons using an efficient sorting algorithm. This leads to the following pseudocode: 1 % compute the rank where the i-th element is the rank of S[i..i + l 1] 2 fun rank(s, l) : int seq = 3 if l = 1 then rankbase(s) else 4 let 5 val rank = rank(s, l/2) 6 val fun rnk i = if (i S ) then 1 else rank [i] 7 val pairs = (i, (rnk i,rnk(i + l/2))) : i = 0,..., S 1 8 val sortedpairs = sort (intpaircmp o second) pairs 9 val newrank = (#1 sortpairs[i], i) : i = 0,..., S 1 10 in inject newrank 1 : i = 0,..., S 1 end Like in the base case, the sequence newrank contains ordered pairs (position, rank) so that the inject function can be used to construct the desired result; however, the same problem that arose with duplicate strings in the base case also shows up here. Exercise 2. The code as writen is broken. Fix the computation newrank when the substrings contain duplicates. Analysis. It is not difficult to see that within each recursive call to rank, we spend O(n log n) work to sort and other bookkeeping. Additionally, for l > 1, we make a recursive call with l half the original l. Therefore, we have the following recurrence: W (l) = W (l/2) + O(n log n), with W (1) = O(n log n) (or O(n) if we re using the first construction for the base case). This solves to W (n) = O(n log 2 n), which is better than O(n 2 ), as we promised. Stay tuned for an SML/NJ implementation of this (as well as a C/C++ implementation) in the Git repository. 7 Version 0.5